Predicting Food Security and Amount of Pee Breaks per Night

Author

Max Tjen

Published

2023-05-07

R Packages and Setup

Code
knitr::opts_chunk$set(comment = NA) 

library(janitor) 
library(nhanesA)
library(naniar)
library(broom)
library(MASS)
library(rms)
library(simputation)
library(nnet)
library(pROC)
library(pscl)
library(yardstick)
library(kableExtra)
library(tidyverse)

theme_set(theme_bw()) 

1 Background

For this project, we will be using National Health and Nutrition Examination Survey (NHANES) data that was collected from 2017-March 2020. The datasets used are demographics, alcohol use, diabetes, food security, income, and kidney conditions.

Our first outcome of interest is food security, which describes an adults food security category over the past 12 months. It’s divided into four categories - very low, low, marginal, and full - and has been coded based on survey results. Full food security means that a subject scored 0 and has “had no problems, or anxiety about, consistently accessing adequate food”. Marginal security means that a subject scored between 1-2 and has “had problems at times, or anxiety about, accessing adequate food, but the quality, variety, and quantity of their food intake were not substantially reduced”. Low security means that a subject scored between 3-5 and has “reduced the quality, variety, and desirability of their diets, but the quantity of food intake and normal eating patterns were not substantially disrupted”. Lastly, the very low category means that a subject scored between 6-10 and at “times during the year, eating patterns of one or more household members were disrupted and food intake reduced because the household lacked money and other resources for food”. All quotes are courtesy of the United States Department of Agriculture. This interests me because of how large of an issue food security is in the United States, despite it never really being recognized as a problem. With that, I think that it’s important to see if there’s a way to predict someone’s food security and identify if they may need help to put food on their plate.

Our second outcome of interest is how many times someone has to urinate at night. This describes the average amount a patient has had to pee at night between going to bed and waking up over the past 30 days. I find this interesting because of how disruptive this can be throughout the night and how it disrupts your sleep cycles and quality of sleep. Additionally, I feel like you always hear about how these occurrences increase as you get older, so I’d like to see if the data agrees with this assumption.

2 Research Questions

How well can we predict an adult’s food security category using ethnicity, poverty index, and systolic blood pressure as predictors?

How well can we predict the average amount of times someone has to pee at night using age, alcohol intake, and systolic blood pressure as predictors?

3 Data Ingest and Management

3.1 Loading the Raw Data and Variable Selection

Code
# load raw data
demo <- nhanes('P_DEMO') |> tibble() |> 
  select(SEQN, RIDAGEYR, RIDRETH3) |> clean_names()
alcohol <- nhanes('P_ALQ') |> tibble() |> 
  select(SEQN, ALQ130) |> clean_names()
diabetes <- nhanes('P_DIQ') |> tibble() |> 
  select(SEQN, DIQ300S) |> clean_names()
food <- nhanes('P_FSQ') |> tibble() |> 
  select(SEQN, FSDAD) |> clean_names()
income <- nhanes('P_INQ') |> tibble() |> 
  select(SEQN, INDFMMPI) |> clean_names()
kidney <- nhanes('P_KIQ_U') |> tibble() |> 
  select(SEQN, KIQ480) |> clean_names()

# join data sets
data <- left_join(demo, alcohol, by = "seqn")
data <- left_join(diabetes, data, by = "seqn")
data <- left_join(food, data, by = "seqn")
data <- left_join(income, data, by = "seqn")
data <- left_join(kidney, data, by = "seqn")

We begin by ingesting the raw datasets from the nhanes package datasets. At the same time, we also select the variables that we will be using and then join the individual datasets into one based on each subject’s seqn value.

3.2 Cleaning the Data

3.2.1 Changing Variable Names and Converting Variable Types

Code
data <- data |>
  mutate(id = as.character(seqn),
         pee = kiq480,
         pov_index = indfmmpi,
         food = as.factor(fsdad),
         sbp = diq300s,
         age = ridageyr,
         ethnicity = as.factor(ridreth3),
         avg_drinks = alq130) |>
  select(-seqn, -kiq480, -indfmmpi, -fsdad, -diq300s, 
         -ridageyr, -ridreth3, -alq130)

Here, we change some variable types and their names to be more descriptive.

3.2.2 Sampling the Data

Our data contains some values that we need to filter out. First, we want to restrict our data to adults so we will filter on age. Next, pee, sbp, and avg_drinks have values that correspond to subjects either not knowing the value or refusing to answer the question, so we will also have to filter these.

Code
data <- data |>
  filter(age >= 21) |> 
  filter(age < 80) |>
  filter(pee != 9 | is.na(pee) == TRUE) |>
  filter(sbp != 7777 | is.na(sbp) == TRUE) |>
  filter(sbp != 9999 | is.na(sbp) == TRUE) |>
  filter(avg_drinks != 777 | is.na(avg_drinks) == TRUE) |>
  filter(avg_drinks != 999 | is.na(avg_drinks) == TRUE)

3.2.3 Working with Categorical Predictors

Code
data <- data |>
  mutate(food_cat = factor(case_when(food == "1" ~ "Full",
                                     food == "2" ~ "Marginal",
                                     food == "3" ~ "Low",
                                     food == "4" ~ "Very Low")),
         ethnicity_cat = factor(case_when(
           ethnicity == "1" ~ "Mexican American",
           ethnicity == "2" ~ "Other Hispanic",
           ethnicity == "3" ~ "Non-Hispanic White",
           ethnicity == "4" ~ "Non-Hispanic Black",
           ethnicity == "6" ~ "Non-Hispanic Asian",
           ethnicity == "7" ~ "Other Race"))) |>
   mutate(food_cat = fct_relevel(
    food_cat, "Very Low", "Low", "Marginal", "Full"),
    food_cat = factor(food_cat, ordered = TRUE)) |>
  select(-food, -ethnicity)

Here, we make our categorical variable values more descriptive.

3.2.4 Checking Missingness

Now, we will check the missingness of the data.

Code
# filter complete cases for both outcomes
data <- data |> 
  filter(complete.cases(food_cat)) |>
  filter(complete.cases(pee))

miss_var_summary(data)
# A tibble: 8 × 3
  variable      n_miss pct_miss
  <chr>          <int>    <dbl>
1 sbp             5850     88.9
2 avg_drinks      1788     27.2
3 pov_index        981     14.9
4 id                 0      0  
5 pee                0      0  
6 age                0      0  
7 food_cat           0      0  
8 ethnicity_cat      0      0  

Because sbp has a very high percentage of missing values, we will drop these missing observations. This is because imputation relies on the existing values, so there wouldn’t be a ton of data to base the estimated values on.

Code
data <- data |>
  filter(is.na(sbp) == FALSE)

3.2.5 Arranging the Tibble

To make our tibble more readable, we will move our outcome variables to the far right.

Code
data <- data |> 
  relocate(food_cat, .after = last_col()) |>
  relocate(pee, .after = last_col())

4 Code Book and Description

4.1 Defining the Variables

The following is our code book to help define variables in our data.

Variable Role Type Description
id identifier character subject identifier
pov_index input quantitative family monthly poverty level index
sbp input quantitative most recent systolic blood pressure reading
age input quantitative age of subject in years at time of screening
avg_drinks input quantitative average number of alcoholic drinks per day the past 12 months
ethnicity_cat input categorical (6) race/hispanic origin [Mexican American, Other Hispanic, Non-Hispanic White, Non-Hispanic Black, Non-Hispanic Asian, Other Race]
food_cat outcome ordinal (4) adult food security category the past 12 months [Very Low, Low, Marginal, Full]
pee outcome count typical amount of urination trips between going to sleep and waking up the past 30 days

4.2 Numerical Description

Next, we will look at quick numerical summaries/descriptions of our variables.

Code
Hmisc::describe(data)
data 

 8  Variables      728  Observations
--------------------------------------------------------------------------------
id 
       n  missing distinct 
     728        0      728 

lowest : 109274 109290 109292 109316 109373, highest: 124734 124741 124752 124778 124788
--------------------------------------------------------------------------------
pov_index 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     628      100      241    0.998     2.43    1.688    0.660    0.770 
     .25      .50      .75      .90      .95 
   1.180    1.995    3.650    5.000    5.000 

lowest : 0.00 0.07 0.14 0.28 0.31, highest: 4.94 4.96 4.97 4.98 5.00
--------------------------------------------------------------------------------
sbp 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     728        0       86    0.996    130.2    18.67    109.0    111.7 
     .25      .50      .75      .90      .95 
   120.0    128.0    138.0    150.3    165.2 

lowest :  58  80  85  87  88, highest: 186 197 198 200 201
--------------------------------------------------------------------------------
age 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     728        0       53    0.999    61.48    12.24       41       46 
     .25      .50      .75      .90      .95 
      54       63       70       75       76 

lowest : 22 25 28 29 30, highest: 75 76 77 78 79
--------------------------------------------------------------------------------
avg_drinks 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     421      307       10    0.876    2.181    1.598        1        1 
     .25      .50      .75      .90      .95 
       1        2        3        4        5 

lowest :  1  2  3  4  5, highest:  6  8 10 12 15
                                                                      
Value          1     2     3     4     5     6     8    10    12    15
Frequency    195   119    48    24    15    10     3     2     3     2
Proportion 0.463 0.283 0.114 0.057 0.036 0.024 0.007 0.005 0.007 0.005
--------------------------------------------------------------------------------
ethnicity_cat 
       n  missing distinct 
     728        0        6 

lowest : Mexican American   Non-Hispanic Asian Non-Hispanic Black Non-Hispanic White Other Hispanic    
highest: Non-Hispanic Asian Non-Hispanic Black Non-Hispanic White Other Hispanic     Other Race        
                                                                   
Value        Mexican American Non-Hispanic Asian Non-Hispanic Black
Frequency                  74                 82                219
Proportion              0.102              0.113              0.301
                                                                   
Value      Non-Hispanic White     Other Hispanic         Other Race
Frequency                 239                 70                 44
Proportion              0.328              0.096              0.060
--------------------------------------------------------------------------------
food_cat 
       n  missing distinct 
     728        0        4 
                                              
Value      Very Low      Low Marginal     Full
Frequency        77       99      111      441
Proportion    0.106    0.136    0.152    0.606
--------------------------------------------------------------------------------
pee 
       n  missing distinct     Info     Mean      Gmd 
     728        0        6    0.949    1.845    1.458 

lowest : 0 1 2 3 4, highest: 1 2 3 4 5
                                              
Value          0     1     2     3     4     5
Frequency    119   199   195   140    41    34
Proportion 0.163 0.273 0.268 0.192 0.056 0.047
--------------------------------------------------------------------------------

5 Analyses

5.1 Train/Test Split

We will split our data into training and testing datasets, which allows us to validate our final models on unseen data later.

Code
set.seed(12321)

# get train and test groups
train_data <- slice_sample(data, n = 500)
test_data <- anti_join(data, train_data, by = "id")

# check dimensions
dim(data)
[1] 728   8
Code
dim(train_data)
[1] 500   8
Code
dim(test_data)
[1] 228   8

5.2 Analysis 1

5.2.1 Research Question

How well can we predict an adult’s food security category in the past 12 months using ethnicity, poverty index, and systolic blood pressure as predictors?

5.2.2 Data Selection

Code
# subset data
a1_train <- train_data |>
  select(ethnicity_cat, pov_index, sbp, food_cat) 

Here, we subset our overall data for just the variables needed in analysis 1.

5.2.3 Missingness

Code
# check missingness
miss_var_summary(a1_train)
# A tibble: 4 × 3
  variable      n_miss pct_miss
  <chr>          <int>    <dbl>
1 pov_index         71     14.2
2 ethnicity_cat      0      0  
3 sbp                0      0  
4 food_cat           0      0  

After looking at our missing data summary, we see that we are missing about 14% of values for pov_index. We will account for this by using simple imputation to estimate values for those missing observations.

5.2.4 Simple Imputation

Code
# imputation - predictive mean matching
a1_train <- a1_train |>
  impute_pmm(pov_index ~ ethnicity_cat + food_cat)

miss_var_summary(a1_train)
# A tibble: 4 × 3
  variable      n_miss pct_miss
  <chr>          <int>    <dbl>
1 ethnicity_cat      0        0
2 pov_index          0        0
3 sbp                0        0
4 food_cat           0        0

Due to missingness, we must impute values for pov_index. To do so, we used predictive mean matching based on ethnicity_cat and food_cat.

5.2.5 Outcome Distribution

Code
summary(a1_train$food_cat)
Very Low      Low Marginal     Full 
      55       61       81      303 
Code
ggplot(a1_train, aes(x = food_cat, fill = food_cat)) +
  geom_bar() +
  labs(title = "Distribution of Food Security Categories",
       x = "Food Security Category",
       y = "Count") +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_fill_brewer(palette = "BuPu")

By visualizing the distribution of our outcome variable food_cat, we can see that a lot of the subjects are within the full category. The other three categories - very low, low, marginal, and full - all have similar counts.

5.2.6 Outcome Distribution by Predictors

5.2.6.1 Ethnicity

Code
ggplot(a1_train, aes(x = food_cat, fill = food_cat)) +
  geom_bar() +
  facet_wrap(~ ethnicity_cat) +
  labs(title = "Distribution of Food Security Categories by Ethnicity",
       x = "Food Security Category",
       y = "Count") +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_fill_brewer(palette = "BuPu")

By breaking down the data by ethnicity, we can see that non-hispanic black and non-hispanic white are particularly imbalanced, as they have lots of subjects within the full category. Overall though, all of the categories’ most frequent food security category is full.

5.2.6.2 Poverty Index

Code
ggplot(a1_train, aes(x = food_cat, y = pov_index, fill = food_cat)) +
  geom_violin() +
  geom_boxplot(width = 0.15, fill = "White") +
  labs(title = "Poverty Index by Food Security Category",
       x = "Food Security Category",
       y = "Poverty Index") +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_fill_brewer(palette = "BuPu") 

With this plot, we can see that the low and marginal food security categories have similar right skew distributions and quartile values of poverty index. The very low security category has a similar median as those two, although the quartile values are lower and the values are more right skewed. The full category has a relatively uniform poverty index distribution, but the quartile poverty index values are much higher than the other three categories.

5.2.6.3 Systolic Blood Pressure

Code
ggplot(a1_train, aes(x = food_cat, y = sbp, fill = food_cat)) +
  geom_violin() +
  geom_boxplot(width = 0.15, fill = "White") +
  labs(title = "SBP by Food Security Category",
       x = "Food Security Category",
       y = "Systolic Blood Pressure") +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_fill_brewer(palette = "BuPu")

The quartile sbp values for each group are quite similar, with the very low category having a slightly higher 3rd quartile value. The value distributions for very low, low, and marginal are all relatively normal, with full being right skewed.

5.2.7 Non-Linearity

Code
summary(a1_train$food_cat)
Very Low      Low Marginal     Full 
      55       61       81      303 
Code
summary(a1_train$ethnicity_cat)
  Mexican American Non-Hispanic Asian Non-Hispanic Black Non-Hispanic White 
                49                 58                148                164 
    Other Hispanic         Other Race 
                48                 33 

Because our smallest outcome category (very low) has 55 observations, the maximum degrees of freedom we can use is 6. However, we currently have 7, so we will collapse the ethnicity group values to be Non-Hispanic Asian, Non-Hispanic Black, Non-Hispanic White, and Other.

Code
a1_train <- a1_train |> 
  mutate(ethnicity_cat = fct_collapse(ethnicity_cat,
                                      "Other" = c("Mexican American", 
                                                  "Other Hispanic", 
                                                  "Other Race")))

summary(a1_train$ethnicity_cat)
             Other Non-Hispanic Asian Non-Hispanic Black Non-Hispanic White 
               130                 58                148                164 

5.2.8 Proportional Odds Logistic Regression

Code
# distribution
dist <- datadist(a1_train)
options(datadist = "dist")

# Create Model
mod1_lrm <- lrm(food_cat ~ ethnicity_cat + pov_index + sbp,
                data = a1_train, x = TRUE, y = TRUE)

mod1_lrm
Logistic Regression Model

lrm(formula = food_cat ~ ethnicity_cat + pov_index + sbp, data = a1_train, 
    x = TRUE, y = TRUE)


Frequencies of Responses

Very Low      Low Marginal     Full 
      55       61       81      303 

                       Model Likelihood    Discrimination    Rank Discrim.    
                             Ratio Test           Indexes          Indexes    
Obs           500    LR chi2     178.50    R2       0.338    C       0.780    
max |deriv| 1e-05    d.f.             5    R2(5,500)0.293    Dxy     0.560    
                     Pr(> chi2) <0.0001    R2(5,385)0.363    gamma   0.560    
                                           Brier    0.151    tau-a   0.325    

                                 Coef    S.E.   Wald Z Pr(>|Z|)
y>=Low                           -1.0580 0.7527 -1.41  0.1598  
y>=Marginal                      -2.0976 0.7528 -2.79  0.0053  
y>=Full                          -3.1021 0.7624 -4.07  <0.0001 
ethnicity_cat=Non-Hispanic Asian  0.7033 0.3930  1.79  0.0735  
ethnicity_cat=Non-Hispanic Black  0.1105 0.2533  0.44  0.6626  
ethnicity_cat=Non-Hispanic White  0.3951 0.2529  1.56  0.1182  
pov_index                         1.1433 0.1125 10.16  <0.0001 
sbp                               0.0068 0.0053  1.28  0.1999  

Here, we fit our proportional odds logistic regression model using lrm(), where we can see that our Nagelkerke \(R^2\) = 0.338 and the C statistic is 0.780.

5.2.9 Multinomial Logistic Regression

Code
# Create Model
mod1_multi <- multinom(food_cat ~ ethnicity_cat + pov_index + sbp,
                       data = a1_train, trace = FALSE)

mod1_multi
Call:
multinom(formula = food_cat ~ ethnicity_cat + pov_index + sbp, 
    data = a1_train, trace = FALSE)

Coefficients:
         (Intercept) ethnicity_catNon-Hispanic Asian
Low        1.4188819                       0.1981612
Marginal   0.4613513                       1.8080048
Full      -1.8119937                       1.2979975
         ethnicity_catNon-Hispanic Black ethnicity_catNon-Hispanic White
Low                         -0.578071453                      -0.1789125
Marginal                     0.466328494                       0.7004697
Full                        -0.009666859                       0.4675812
         pov_index          sbp
Low      0.6100889 -0.015265407
Marginal 0.3793099 -0.008335024
Full     1.5612294  0.002064003

Residual Deviance: 903.5196 
AIC: 939.5196 

Here, we create our multinomial model with the same main effects.

5.2.10 Model Comparison

Code
# likelihood ratio test
lrm_logLik <- logLik(mod1_lrm)[1]
multi_logLik <- logLik(mod1_multi)[1]

# calculate p-value
pchisq(-2 * (lrm_logLik - multi_logLik),
       df = 10, lower.tail = FALSE)
[1] 0.1044315
Code
# AIC values for each model
AIC(mod1_lrm)
[1] 935.3554
Code
AIC(mod1_multi)
[1] 939.5196

To compare the fit of these two models, a likelihood ratio test was used along with comparing AIC values. For the likelihood ratio test, we first had to find the degrees of freedom to use, which was 10. This was found by getting the difference in parameters between the models, which were 8 for the lrm model and 18 for the multinomial model, resulting in a difference of 10. The outputted p-value was 0.104, indicating no meaningful difference between the two models. Because of this, we then looked at the AIC values for the two models. These are also very similar, with the lrm model having a slightly better score. All of this implies that our proportional odds assumption is reasonable and that we can safely assume proportional odds and move forward with the lrm model.

5.2.11 Final Model - Proportional Odds

Code
mod1_lrm$coefficients
                          y>=Low                      y>=Marginal 
                     -1.05798930                      -2.09757419 
                         y>=Full ethnicity_cat=Non-Hispanic Asian 
                     -3.10205296                       0.70334431 
ethnicity_cat=Non-Hispanic Black ethnicity_cat=Non-Hispanic White 
                      0.11053691                       0.39513215 
                       pov_index                              sbp 
                      1.14328638                       0.00682121 
Code
summary(mod1_lrm)
             Effects              Response : food_cat 

 Factor                                                Low    High     Diff.  
 pov_index                                               1.31   3.3875  2.0775
  Odds Ratio                                             1.31   3.3875  2.0775
 sbp                                                   120.00 138.0000 18.0000
  Odds Ratio                                           120.00 138.0000 18.0000
 ethnicity_cat - Other:Non-Hispanic White                4.00   1.0000      NA
  Odds Ratio                                             4.00   1.0000      NA
 ethnicity_cat - Non-Hispanic Asian:Non-Hispanic White   4.00   2.0000      NA
  Odds Ratio                                             4.00   2.0000      NA
 ethnicity_cat - Non-Hispanic Black:Non-Hispanic White   4.00   3.0000      NA
  Odds Ratio                                             4.00   3.0000      NA
 Effect   S.E.    Lower 0.95 Upper 0.95
  2.37520 0.23374  1.917100   2.83330  
 10.75300      NA  6.800900  17.00100  
  0.12278 0.09578 -0.064943   0.31051  
  1.13060      NA  0.937120   1.36410  
 -0.39513 0.25290 -0.890810   0.10054  
  0.67359      NA  0.410330   1.10580  
  0.30821 0.38855 -0.453330   1.06970  
  1.36100      NA  0.635510   2.91460  
 -0.28460 0.24708 -0.768870   0.19968  
  0.75232      NA  0.463540   1.22100  
Code
plot(summary(mod1_lrm))

As suggested by the coefficients before, pov_index has a very large effect size while the other variables’ effect size odds ratios stay relatively close to 1.

Code
func_low <- function(x) plogis(x - mod1_lrm$coef[1] + mod1_lrm$coef[2])
func_marg <- function(x) plogis(x - mod1_lrm$coef[1] + mod1_lrm$coef[3])
func_full <- function(x) plogis(x - mod1_lrm$coef[1] + mod1_lrm$coef[4])

# nomogram
plot(nomogram(mod1_lrm, fun = list('Pr(Very Low)' = plogis, 
                                   'Pr(Low)' = func_low,
                                   'Pr(Marginal)' = func_marg,
                                   'Pr(Full)' = func_full),
              abbrev = TRUE),
     cex.axis = 0.55)

Code
ggplot(Predict(mod1_lrm, fun = Mean(mod1_lrm, code = TRUE)))

From these mean prediction plots, we can see that the mean predictions for ethnicity_cat and sbp are relatively similar across various values. For ethnicity_cat, it looks like the other category has the smallest average prediction value while non-hispanic asian has the highest, although the 95% confidence intervals have lots of overlap. For sbp, the predicted values are pretty much the same across all sbp values. The one variable that shows lots of difference is pov_index, particularly among those with an index value less than 3. For values less than 3, the predictive value goes up quite significantly as the index value increases, demonstrating how influential it is to the value predictions. Once the index value is above 3 though, the mean value predictions stay relatively constant.

5.2.12 Model Validation

Code
set.seed(12321)

validate(mod1_lrm)
          index.orig training    test optimism index.corrected  n
Dxy           0.5599   0.5667  0.5538   0.0128          0.5470 40
R2            0.3378   0.3444  0.3301   0.0143          0.3235 40
Intercept     0.0000   0.0000  0.0323  -0.0323          0.0323 40
Slope         1.0000   1.0000  0.9617   0.0383          0.9617 40
Emax          0.0000   0.0000  0.0142   0.0142          0.0142 40
D             0.3550   0.3646  0.3453   0.0193          0.3357 40
U            -0.0040  -0.0040 -0.9286   0.9246         -0.9286 40
Q             0.3590   0.3686  1.2739  -0.9053          1.2643 40
B             0.1507   0.1495  0.1523  -0.0028          0.1535 40
g             1.8676   1.8988  1.8189   0.0800          1.7876 40
gp            0.2168   0.2189  0.2145   0.0044          0.2125 40
Code
0.5470/2 + 0.5
[1] 0.7735

By validating our training model with the same training data, we get a validated Nagelkerke \(R^2\) = 0.324 and C statistic of 0.774, which is still decent.

5.2.13 Test Data

Code
# perform same steps done on training data on test data
a1_test <- test_data |>
  select(ethnicity_cat, pov_index, sbp, food_cat) |>
  impute_pmm(pov_index ~ ethnicity_cat + food_cat) |>
  mutate(ethnicity_cat = fct_collapse(ethnicity_cat,
                                      "Other" = c("Mexican American", 
                                                  "Other Hispanic", 
                                                  "Other Race")))

# make predictions
m1_predictions <- predict(mod1_lrm, 
                          newdata = as.data.frame(a1_test),
                          type = "fitted.ind")

head(m1_predictions)
  food_cat=Very Low food_cat=Low food_cat=Marginal food_cat=Full
1       0.246223345  0.233968980        0.23590947     0.2838982
2       0.004230543  0.007641797        0.01989229     0.9682354
3       0.129550667  0.166671581        0.23850330     0.4652745
4       0.084221616  0.122182582        0.20885662     0.5847392
5       0.180280917  0.203185554        0.24592760     0.3706059
6       0.030229571  0.050784060        0.11299426     0.8059921

We can now move to our test data, where we can get the estimated food security category probabilities for each subject.

Code
# function to find highest probability category for each subject
get_pred_cat <- function(category_predictions) {
  cat_preds <- c()      # vector to hold highest probability category

  # iterate through predictions
  for (index in 1:dim(category_predictions)[1]) {
    vals <- category_predictions[index,]
    label <- "Very Low"
    
    if ((vals[1] < vals[2]) == TRUE) {
      win1 <- vals[2]
      label <- "Low"
    }
    else {
      win1 <- vals[1]
    }
    
    if ((win1 < vals[3]) == TRUE) {
       win2 <- vals[3]
      label <- "Marginal"
    }
    else {
      win2 <- win1
    }
    
    if ((win2 < vals[4]) == TRUE) {
      label <- "Full"
    }
    cat_preds <- append(cat_preds, label)   # append category
  }
  
  return(cat_preds)
}


preds <- get_pred_cat(m1_predictions)

# show output
head(m1_predictions, 10)
   food_cat=Very Low food_cat=Low food_cat=Marginal food_cat=Full
1        0.246223345  0.233968980        0.23590947     0.2838982
2        0.004230543  0.007641797        0.01989229     0.9682354
3        0.129550667  0.166671581        0.23850330     0.4652745
4        0.084221616  0.122182582        0.20885662     0.5847392
5        0.180280917  0.203185554        0.24592760     0.3706059
6        0.030229571  0.050784060        0.11299426     0.8059921
7        0.148998798  0.182172901        0.24365906     0.4251692
8        0.041957595  0.068247518        0.14251279     0.7472821
9        0.343234626  0.253209908        0.20496878     0.1985867
10       0.029337784  0.049407556        0.11048199     0.8107727
Code
head(preds, 10)
 [1] "Full"     "Full"     "Full"     "Full"     "Full"     "Full"    
 [7] "Full"     "Full"     "Very Low" "Full"    
Code
multi_roc <- a1_test |>
  mutate(pred = preds) |>                  # add predictions
  mutate(pred = as.numeric(case_when(
    pred == "Very Low" ~ 0,                # make values numeric for function
    pred == "Low" ~ 1,
    pred == "Marginal" ~ 2,
    pred == "Full" ~ 3)))

# get ROC
multiclass.roc(multi_roc$food_cat ~ multi_roc$pred, 
               direction = "<")

Call:
multiclass.roc.formula(formula = multi_roc$food_cat ~ multi_roc$pred,     direction = "<")

Data: multi_roc$pred with 4 levels of multi_roc$food_cat: Very Low, Low, Marginal, Full.
Multi-class area under the curve: 0.5958

We can also look at the C statistic for the test data, which isn’t great at 0.5958. To get this, we first had to find which food security category had the highest odds was most likely and then add these predictions to our test data. Then we could run the multiclass.roc function and get a result.

Code
mod1_polr <- polr(food_cat ~ ethnicity_cat + pov_index + sbp,
                  data = a1_train, Hess = TRUE)

results_1 <- addmargins(table(predict(mod1_polr, newdata = as.data.frame(a1_test)),
                              a1_test$food_cat))

results_1
          
           Very Low Low Marginal Full Sum
  Very Low        9  10        5    8  32
  Low             0   0        0    0   0
  Marginal        0   0        0    0   0
  Full           13  28       25  130 196
  Sum            22  38       30  138 228
Code
(9+130)/228 * 100
[1] 60.96491

By creating a polr model and running a classification table on the test data, we can calculate that the model correctly predicted 61.0% of the test data. We can also see that our model only predicts that the test observations will be in either very low or full. This is likely due to pov_index being an influential variable and the others not impacting prediction scores much. As such, the model almost becomes dependent on the values of one predictor.

5.3 Analysis 2

5.3.1 Research Question

How well can we predict the average amount of times someone has to pee at night using age, alcohol intake, and systolic blood pressure as predictors?

5.3.2 Data Selection

Code
# subset data
a2_train <- train_data |>
  select(age, avg_drinks, sbp, pee)

Here, we subset our overall data for just the variables needed in analysis 2.

5.3.3 Missingness

Code
# check missingness
miss_var_summary(a2_train)
# A tibble: 4 × 3
  variable   n_miss pct_miss
  <chr>       <int>    <dbl>
1 avg_drinks    205       41
2 age             0        0
3 sbp             0        0
4 pee             0        0

By checking our missing values, we can see that the only variable that has missingness is avg_drinks, which is missing 205 (41%) of values.

5.3.4 Simple Imputation

Code
# imputation - predictive mean matching
a2_train <- a2_train |>
  impute_pmm(avg_drinks ~ age + sbp)

miss_var_summary(a2_train)
# A tibble: 4 × 3
  variable   n_miss pct_miss
  <chr>       <int>    <dbl>
1 age             0        0
2 avg_drinks      0        0
3 sbp             0        0
4 pee             0        0

Due to missingness, we must impute values for avg_drinks. To do so, we used predictive mean matching based on age and sbp.

5.3.5 Outcome Distribution

Code
summary(a2_train |> mutate(pee = factor(pee)) |> select(pee))
 pee    
 0: 78  
 1:140  
 2:138  
 3: 91  
 4: 29  
 5: 24  
Code
ggplot(a2_train, aes(x = factor(pee), fill = factor(pee))) +
  geom_bar() +
  labs(title = "Distribution of Pee Breaks per Night",
       x = "Amount of Breaks",
       y = "Count") +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_fill_brewer(palette = "Spectral")

From this graph, we can see that the distribution of pee is relatively normal, with one and two being the most frequent amounts of pee breaks per night.

5.3.6 Outcome Distribution by Predictors

5.3.6.1 Age

Code
ggplot(a2_train, aes(x = factor(pee), y = age, fill = factor(pee))) +
  geom_violin() +
  geom_boxplot(width = 0.2, fill = "White") +
  labs(title = "Age by Pee Breaks per Night",
       x = "Pee Breaks per Night",
       y = "Age") +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_fill_brewer(palette = "Spectral") +
  coord_flip()

It appears that for each of the pee break count categories except for 0, the data is pretty normal with various degrees of left skew. For the 0 breaks group, the data looks normally distributed without much skew.

5.3.6.2 Average Alcoholic Drinks per Day

Code
ggplot(a2_train, aes(x = factor(avg_drinks), fill = factor(avg_drinks))) +
  geom_bar() +
  facet_wrap(.~ factor(pee)) +
  labs(title = "Distribution of Alcoholic Drinks per Day by Pee Breaks per Night",
       x = "Amount of Alcoholic Drinks per Day",
       y = "Count") +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_fill_brewer(palette = "Spectral")

From the faceted plot, we can see that for most of the pee break count categories, counts are heavily weighted towards those who have one or two drinks per day.

5.3.6.3 Systolic Blood Pressure

Code
ggplot(a2_train, aes(x = factor(pee), y = sbp, fill = factor(pee))) +
  geom_violin() +
  geom_boxplot(width = 0.2, fill = "White") +
  labs(title = "SBP by Pee Breaks per Night",
       x = "Pee Breaks per Night",
       y = "Systolic Blood Pressure") +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_fill_brewer(palette = "Spectral") + coord_flip()

For all the pee break count categories, the systolic blood pressure distributions are pretty normal with 5 looking relatively uniform. The quartile values for 0-3 are all quite similar, with 4’s values being slightly lower and 5 having a wider inter-quartile range.

5.3.7 Non-Linearity

Code
# build spearman object
mod2_spear <- spearman2(pee ~ age + avg_drinks + sbp, 
                        data = a2_train)

plot(mod2_spear)

Given that our outcome is a count variable and we have 500 observations in our training data, we have a maximum of 12 degrees of freedom available. With our result, we will try fitting a 3 knot cubic spline on avg_drinks and a 3 knot cubic spline on age.

Code
mod2_nonlin <- glm(pee ~ rcs(age, 3) + rcs(avg_drinks, 3) + sbp, 
                   data = a2_train, family = "poisson")

anova(mod2_nonlin, test = "Chisq")
Analysis of Deviance Table

Model: poisson, link: log

Response: pee

Terms added sequentially (first to last)

                   Df Deviance Resid. Df Resid. Dev Pr(>Chi)   
NULL                                 499     552.25            
rcs(age, 3)         2   5.7381       497     546.51 0.056752 . 
rcs(avg_drinks, 3)  2  12.7721       495     533.74 0.001685 **
sbp                 1   1.1507       494     532.59 0.283398   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

By testing these new non-linear predictor terms, we can see that avg_drinks with a 3 knot cubic spline is the only term that adds significant predictive value to the model given the other predictors.

5.3.8 Poisson Regression

Code
mod2_poi <- glm(pee ~ rcs(avg_drinks, 3) + age + sbp, 
                data = a2_train, family = "poisson")

tidy(mod2_poi) |> 
  select(term, estimate, std.error, p.value) |> 
  kable(digits = 3)
term estimate std.error p.value
(Intercept) -0.512 0.336 0.127
rcs(avg_drinks, 3)avg_drinks 0.218 0.090 0.015
rcs(avg_drinks, 3)avg_drinks' -0.117 0.068 0.088
age 0.008 0.003 0.013
sbp 0.002 0.002 0.258
Code
countreg::rootogram(mod2_poi)

We can see from the rootogram of the poisson regression model that the model fits pretty well, with there not being too much under/over prediction for each count category. It overfits counts of 0, 1, and 4 and underfits counts of 2, 3, and 5.

5.3.9 Zero-Inflated Poisson Model

Code
mod2_zip <- pscl::zeroinfl(pee ~ rcs(avg_drinks, 3) + age + sbp,
                           data = a2_train)

summary(mod2_zip)

Call:
pscl::zeroinfl(formula = pee ~ rcs(avg_drinks, 3) + age + sbp, data = a2_train)

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-1.5831 -0.6685 -0.0422  0.6550  2.8310 

Count model coefficients (poisson with log link):
                               Estimate Std. Error z value Pr(>|z|)  
(Intercept)                   -0.462917   0.353785  -1.308   0.1907  
rcs(avg_drinks, 3)avg_drinks   0.192252   0.107785   1.784   0.0745 .
rcs(avg_drinks, 3)avg_drinks' -0.098891   0.079630  -1.242   0.2143  
age                            0.007853   0.003222   2.437   0.0148 *
sbp                            0.002039   0.001804   1.130   0.2583  

Zero-inflation model coefficients (binomial with logit link):
                                Estimate Std. Error z value Pr(>|z|)
(Intercept)                    3.6672460        NaN     NaN      NaN
rcs(avg_drinks, 3)avg_drinks  -7.8566824        NaN     NaN      NaN
rcs(avg_drinks, 3)avg_drinks'  1.1584113        NaN     NaN      NaN
age                            0.0009988  0.0970214   0.010    0.992
sbp                            0.0025882  0.0559863   0.046    0.963
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Number of iterations in BFGS optimization: 43 
Log-likelihood: -813.2 on 10 Df
Code
countreg::rootogram(mod2_zip)

The rootogram of the zero inflated poisson regression model is very similar to that of the regular poisson regression model and again fits pretty well. Similar to before, it overfits counts of 0, 1, and 4 and underfits counts of 2, 3, and 5.

5.3.10 Model Comparison

Code
# Vuong test
vuong(mod2_poi, mod2_zip)
Vuong Non-Nested Hypothesis Test-Statistic: 
(test-statistic is asymptotically distributed N(0,1) under the
 null that the models are indistinguishible)
-------------------------------------------------------------
              Vuong z-statistic             H_A p-value
Raw                  -0.2124794 model2 > model1 0.41587
AIC-corrected        11.4681252 model1 > model2 < 2e-16
BIC-corrected        36.0827105 model1 > model2 < 2e-16

As seen before, the rootograms for the two models look nearly identical, indicating that they may perform very similarly. To confirm, we ran a Vuong test, where we saw that there is some evidence that the zero-inflated model fits better. However, the p-value suggests that the difference is not to a statistically detectable degree so we will move forward with the poisson regression model.

5.3.11 Final Model - Poisson Regression

Code
tidy(mod2_poi) |> 
  select(term, estimate, std.error, p.value) |> 
  kable(digits = 3)
term estimate std.error p.value
(Intercept) -0.512 0.336 0.127
rcs(avg_drinks, 3)avg_drinks 0.218 0.090 0.015
rcs(avg_drinks, 3)avg_drinks' -0.117 0.068 0.088
age 0.008 0.003 0.013
sbp 0.002 0.002 0.258
Code
mod2_aug <- augment(mod2_poi, a2_train, 
                    type.predict = "response")

mets <- metric_set(rsq, rmse, mae)

mets(mod2_aug, truth = pee, estimate = .fitted) |>
  kable(digits = 3)
.metric .estimator .estimate
rsq standard 0.037
rmse standard 1.290
mae standard 1.030

We can now look at how our final model performs on the training data, where we get an \(R^2\) of 0.037, root mean square error of 1.290, and mean absolute error of 1.030.

5.3.12 Model Validation

Code
a2_test <- test_data |>
  select(age, avg_drinks, sbp, pee) |>
  impute_pmm(avg_drinks ~ age + sbp)

mod2_aug_test <- augment(mod2_poi, newdata = a2_test, 
                         type.predict = "response")

mets(mod2_aug_test, truth = pee, estimate = .fitted) |>
  kable(digits = 3)
.metric .estimator .estimate
rsq standard 0.028
rmse standard 1.316
mae standard 1.068

After validating our model with the test data, we get an \(R^2\) of 0.028, root mean square error of 1.316, and mean absolute error of 1.068. While these values are all slightly worse than the training results, they are all very similar, suggesting that our model will hold up and still be effective with new data.

6 Conclusions and Discussion

6.1 Answering My Research Questions

6.1.1 Analysis 1 Conclusion

Our research question was: how well can we predict an adult’s food security category in the past 12 months using ethnicity, poverty index, and systolic blood pressure as predictors? Once we validated our final model using test data, it became evident that our model wasn’t very robust and didn’t perform very well. We got a multi-category ROC score of 0.5958 and an accuracy of 61.0%, indicating that our model wasn’t very good and was doing slightly better than guessing. Looking more at the model predictions, we can see that it only predicted values subjects to be in either the very low or full category, ignoring low and marginal. This is likely because only pov_index seemingly provided any predictive value, and the coefficient for it was quite large.

6.1.2 Analysis 1 Discussion

Most of the modeling issues we had was likely down to our small sample size, where we had a total of 728 subjects in our data and then subsetted it to 500 for the training data. With this, the model didn’t have a ton of data to use during development. Additionally, we used simple imputation for missing values, which may not have provided the most representative values for those missing. This may not be a huge factor though, as there were only 71 missing values in our training data, all being from pov_index. To improve this, we could use multiple imputation to get more accurate or representative values relative to the rest of the data. Another limitation was how there were a couple of ethnicity_cat categories that didn’t have many observations. As such, we had to merge some of them together so that the category’s were more balanced. We also were limited to 6 degrees of freedom for our predictors, so we didn’t check for possible non-linear terms. Finally, the easiest and most influential thing that can be done to improve our predictions would be to select more useful predictor variables. Two of our three predictors (ethnicity and systolic blood pressure) didn’t appear to contribute much to the model, which can particularly be seen in the mean prediction plots. The various values for each didn’t move the needle in one way or another in terms of helping the model determine the subject’s food security category. As such, by choosing better variables that help indicate one’s food security, we will be able to obtain better predictions.

6.1.3 Analysis 2 Conclusion

Our research question was: how well can we predict the average amount of times someone has to pee at night using age, alcohol intake, and systolic blood pressure as predictors? The rootogram for our model fit pretty good overall, with there not being too much under/over fitting for any count value. After validating the poisson regression model, we can conclude that our model will likely still hold up with new data. Our validated results were very similar to training results, which points to our model holding up well with new and unseen data. With these results, we can conclude that age, daily alcohol intake, and systolic blood pressure are decently good predictors to predict the average amount of times someone has to pee at night.

6.1.4 Analysis 2 Discussion

Given how well the holdout data performed with the trained model, there aren’t many caveats that we have about our model. One thing that may affect our performance is that our overall dataset (train and test) is relatively small (728 total), which may mean that our data isn’t completely representative in general. A larger limitation as to our model’s validity is regarding our imputation usage. We used simple imputation, which may not have been as effective as multiple imputation. For importantly though, we were missing 41% of values for avg_drinks in our training data and ~45% in our test data. With this, roughly half the values for each dataset is imputed, meaning that our values may not be the most representative of the actual distribution of average alcoholic drinks had per day for the past 12 months.

6.1.5 Project Lessons

Throughout this project, I think that there are two main things that I learned. The first is with working with multi-categorical predictors and the types of plots that can be created to help visualize predictions. Next, I learned a lot about using multiple imputation for various models, which I was working to do at first in this project.

7 References and Acknowledgments

7.1 References

Here are our links to the datasets used for the project. Please note that while it says the data is from 2017-2018, the webpage says it’s from 2017-March 2020.

We also looked at the US Department of Agriculture for more details regarding food security.

https://www.ers.usda.gov/topics/food-nutrition-assistance/food-security-in-the-u-s/measurement/

7.2 Acknowledgments

I would like to thank Dr. Thomas Love for his teaching and guidance throughout the past two semesters. I would also to thank the Case Western Reserve University Department of Population and Quantitative Health Sciences for providing resources for this endeavor.

8 Session Information

Code
xfun::session_info()
R version 4.2.2 (2022-10-31)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur ... 10.16

Locale: en_US.UTF-8 / en_US.UTF-8 / en_US.UTF-8 / C / en_US.UTF-8 / en_US.UTF-8

Package version:
  abind_1.4.5          askpass_1.1          backports_1.4.1     
  base64enc_0.1-3      bit_4.0.5            bit64_4.0.5         
  blob_1.2.3           boot_1.3.28.1        brio_1.1.3          
  broom_1.0.3          bslib_0.4.2          cachem_1.0.7        
  callr_3.7.3          car_3.1.1            carData_3.0.5       
  cellranger_1.1.0     checkmate_2.1.0      class_7.3.21        
  cli_3.6.0            clipr_0.8.0          cluster_2.1.4       
  codetools_0.2-19     colorspace_2.1-0     compiler_4.2.2      
  conflicted_1.2.0     countreg_0.2-1       cpp11_0.4.3         
  crayon_1.5.2         curl_5.0.0           data.table_1.14.8   
  datasets_4.2.2       DBI_1.1.3            dbplyr_2.3.1        
  deldir_1.0-6         DEoptimR_1.0.11      desc_1.4.2          
  diffobj_0.3.5        digest_0.6.31        doRNG_1.8.6         
  dplyr_1.1.0          dtplyr_1.3.0         e1071_1.7.13        
  ellipsis_0.3.2       evaluate_0.20        fansi_1.0.4         
  farver_2.1.1         fastmap_1.1.1        forcats_1.0.0       
  foreach_1.5.2        foreign_0.8-84       Formula_1.2-5       
  fs_1.6.1             gargle_1.3.0         generics_0.1.3      
  ggplot2_3.4.1        glmnet_4.1.6         glue_1.6.2          
  googledrive_2.0.0    googlesheets4_1.0.1  gower_1.0.1         
  graphics_4.2.2       grDevices_4.2.2      grid_4.2.2          
  gridExtra_2.3        gtable_0.3.1         hardhat_1.2.0       
  haven_2.5.1          highr_0.10           Hmisc_4.8-0         
  hms_1.1.2            htmlTable_2.4.1      htmltools_0.5.4     
  htmlwidgets_1.6.1    httr_1.4.5           ids_1.0.1           
  interp_1.1-3         isoband_0.2.7        iterators_1.0.14    
  itertools_0.1.3      janitor_2.2.0        jpeg_0.1-10         
  jquerylib_0.1.4      jsonlite_1.8.4       kableExtra_1.3.4    
  knitr_1.42           labeling_0.4.2       laeken_0.5.2        
  lattice_0.20-45      latticeExtra_0.6-30  lifecycle_1.0.3     
  lme4_1.1.31          lmtest_0.9.40        lubridate_1.9.2     
  magrittr_2.0.3       MASS_7.3-58.2        Matrix_1.5-3        
  MatrixModels_0.5-1   memoise_2.0.1        methods_4.2.2       
  mgcv_1.8.41          mime_0.12            minqa_1.2.5         
  missForest_1.5       modelr_0.1.10        multcomp_1.4-22     
  munsell_0.5.0        mvtnorm_1.1-3        naniar_1.0.0        
  nhanesA_0.7.2        nlme_3.1-162         nloptr_2.0.3        
  nnet_7.3-18          norm_1.0.10.0        numDeriv_2016.8.1.1 
  openssl_2.0.5        parallel_4.2.2       pbkrtest_0.5.2      
  pillar_1.8.1         pkgconfig_2.0.3      pkgload_1.3.2       
  plyr_1.8.8           png_0.1-8            polspline_1.1.22    
  praise_1.0.0         prettyunits_1.1.1    pROC_1.18.0         
  processx_3.8.0       progress_1.2.2       proxy_0.4.27        
  ps_1.7.2             pscl_1.5.5           purrr_1.0.1         
  quantreg_5.94        R6_2.5.1             ragg_1.2.5          
  randomForest_4.7.1.1 ranger_0.14.1        rappdirs_0.3.3      
  RColorBrewer_1.1-3   Rcpp_1.0.10          RcppEigen_0.3.3.9.3 
  readr_2.1.4          readxl_1.4.2         rematch_1.0.1       
  rematch2_2.1.2       reprex_2.0.2         rlang_1.1.0         
  rmarkdown_2.20       rms_6.5-0            rngtools_1.5.2      
  robustbase_0.95.0    rpart_4.1.19         rprojroot_2.0.3     
  rstudioapi_0.14      rvest_1.0.3          sandwich_3.0-2      
  sass_0.4.5           scales_1.2.1         selectr_0.4.2       
  shape_1.4.6          simputation_0.2.8    snakecase_0.11.0    
  sp_1.6.0             SparseM_1.81         splines_4.2.2       
  stats_4.2.2          stringi_1.7.12       stringr_1.5.0       
  survival_3.5-3       svglite_2.1.1        sys_3.4.1           
  systemfonts_1.0.4    testthat_3.1.6       textshaping_0.3.6   
  TH.data_1.1-1        tibble_3.1.8         tidyr_1.3.0         
  tidyselect_1.2.0     tidyverse_2.0.0      timechange_0.2.0    
  tinytex_0.44         tools_4.2.2          tzdb_0.3.0          
  UpSetR_1.4.0         utf8_1.2.3           utils_4.2.2         
  uuid_1.1.0           vcd_1.4.11           vctrs_0.6.1         
  VIM_6.2.2            viridis_0.6.2        viridisLite_0.4.1   
  visdat_0.6.0         vroom_1.6.1          waldo_0.4.0         
  webshot_0.5.4        withr_2.5.0          xfun_0.37           
  xml2_1.3.3           yaml_2.3.7           yardstick_1.1.0     
  zoo_1.8-11